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Abstract —Record linkage concerns identifying semantically 
equivalent records in databases. Blocking methods are employed 
to avoid the cost of full pairwise similarity comparisons on n 
records. In a seminal work, Hernandez and Stolfo proposed 
the Sorted Neighborhood blocking method. Several empirical 
variants have been proposed in recent years. In this paper, we 
investigate the complexity of the Sorted Neighborhood procedure 
on which the variants are built. We show that achieving maximum 
performance on the Sorted Neighborhood procedure entails 
solving a sub-problem, which is shown to be NP-complete by 
reducing from the Travelling Salesman Problem. We also show 
that the sub-problem can occur in the traditional blocking 
method. Finally, we draw on recent developments concerning 
approximate Travelling Salesman solutions to define and analyze 
three approximation algorithms. 

Index Terms —Blocking, Record Linkage, Sorted Neighbor¬ 
hood, Data Matching, Complexity 

I. Introduction 

Record linkage concerns identifying pairs of records that 
refer to the same underlying entity but are syntactically 
disparate. The problem goes by multiple names in the database 
community, examples being entity resolution |f|, instance 
matching j2j, co-reference resolution Q, hardening soft 
databases |4j and the merge-purge problem RJ. 

Given n records and a sophisticated similarity function <j 
that determines whether two records are equivalent, a naive 
record linkage application would run in time Q(t(g)n 2 ), where 
t(g) is the run-time of g. Scalability indicates a two-step 
approach [| 6 j. First, blocking generates a candidate set of 
promising pairs that have the potential to be duplicates. The 
vast majority of pairs are discarded in this step, leading to 
significant savings [j7j. Because of the need to limit complexity 
of record linkage to near linear-time, blocking has emerged 
as a research area in its own right |7j, [ 81. 

Sorted Neighborhood is a popular blocking method pub¬ 
lished originally by Hernandez and Stolfo |5). The method was 
found to have excellent empirical performance 0, 0. In the 
past two decades, numerous empirical variants have been pub¬ 
lished flOl, ]TT|, including an application to XML duplicate 
detection - ] l/flT Parallel implementations also continue to be 
researched. For example, a MapReduce -based implementation 
of Sorted Neighborhood was published in 2012 ]13) , fl4| . The 
evidence indicates that Sorted Neighborhood remains topical 
in the data matching community. 

Table [I] used as a running example throughout the paper, 
illustrates the original method. First, a blocking key is defined, 
and applied on each record to generate a blocking key value 
or BKV for the record. In Table [I] it is defined as extracting 
and concatenating initial characters from attribute tokens in 
the record in order to generate the BKV. Each record’s BKV 
has also been noted in Table |T] Next, the BKVs are used as 
sorting keys. Finally, a window of constant size w > 2 is slid 
over the sorted records from beginning to end, with records 


sharing a window paired, and the pair added to the candidate 
set. With w = 2, for example, record paiiQ (ri,r 2 ) would 
get added to the (initially empty) candidate set. Sliding the 
window forward, record pair (r 2 ,rf) is added. The process 
continues, with record pair (r^ry) being the final addition to 
the set. 

We define the w-ordering problem as the problem of sorting 
records that have the same BKVs, as in the case of records 
r\,T 2 and r 3 . The definition is formally given as Definition[3]in 
Section [IV] Suppose that a polynomial-time scoring heuristic 
f is provided, such that / returns a real-valued similarity 
score for a given record pair. The goal is to order records 
so as to maximize the score of the resulting candidate set 
for given / and w, but without violating the sorting order 
imposed by the BKVs. Assuming w = 2 and a heuristic based 
on first and last name similarity, candidate set score will be 
maximized for the ordering in Table [I] Table [T] is then said 
to be a maximum-score 2 -ordering for the corresponding set 
of records { r t . r 2 ,.... r 7 }. As an example of a 2-ordering 
that’s not maximum-score, consider reversing the positions of 
7-5 and re. In this case, the reversal would cause two potentially 
duplicate pairs to get left out of the candidate set, but which 
contributed scores to the candidate set earlier. 


Herein it is shown that, in the general case, achieving 
a maximum-score w-ordering for a set of records is NP- 
complete. A Karp reduction from the NP-complete Travel ling 
Salesman Problem (TSP) is presented (Theorem[2] SectionlfV 1 
ff5[ . To the best of our knowledge, w-ordering nas not been 
studied in previous literature on blocking methods. A possible 
explanation is that practitioners assumed a random ordering 
with a large window size w to yield a good empirical solution, 
especially for small datasets. As Hernandez and Stolfo found 
in their own experiments, such an approach is outperformed by 
a multi-pass Sorted Neighborhood approach with inexpensive 
blocking keys, and small window sizes |5J. In particular, 3 
passes, the minimum window size of 2 and transitive closure 
in the second record linkage step was found to achieve a good 
balance of run-time and accuracy on a test databas^] 

Given these findings and that the run-time of a multi¬ 
pass approach is proportional to both w and the number of 
runs |p|, we argue that refining Sorted Neighborhood further 
even for w = 2 is an important problem. A review of 
TSP literature shows that improved approximation bounds 
for the max tour-TSP variant continue to be proposed ©• 
By reducing maximum-score 2-ordering to max TSP, we 
devise three polynomial-time approximation algorithms for 
maximum-score 2-ordering. The goal is to improve theoretical 
SN performance by presenting tractable, bounded approxima¬ 
tions for maximum-score 2 -ordering. 


Vi refers to record with ID i, i £ (1, 2,..., 7} 

2 As evidence, we refer the reader to Figure 4 and the conclusion in the 
original paper |5j 



TABLE I 

Records sorted using blocking key values (BKVs) 


ID 

First Name 

Last Name 

Zip 

BKV 

1 

Cathy 

Ransom 

77111 

CR7 

2 

Catherine 

Ridley 

77093 

CR7 

3 

Cathy 

Ridley 

77093 

CR7 

4 

John 

Rogers 

78751 

JR7 

5 

J. 

Rogers 

78732 

JR7 

6 

John 

Ridley 

77093 

JR7 

7 

John 

Ridley Sr. 

77093 

JRS7 


Two of the three proposed algorithms present multi¬ 
pass Sorted Neighborhood but with approximate solutions 
to maximum-score 2-ordering integrated into the procedure. 
A third algorithm presents a similar solution for traditional 
blocking (7) 


in the MapReduce paradigm G3- We present 
this case for two reasons. First, the case shows that solutions 
to the ordering problem need not be restricted to Sorted 
Neighborhood, but potentially apply to other popular blocking 
methods as well. Secondly, it demonstrates that 2-ordering 
solutions do not necessitate serial architectures. 

The three algorithms invoke a max TSP subroutine as a 
black box, and their qualitative performance is shown to 
closely mirror that of the invoked subroutine. This implies 
that further improvements in max TSP bounds directly lead to 
similar improvements in the algorithms. Using current TSP 
results, the bound is 61/81 for arbitrary non-negative edge 
weight functions m- We devise an appropriate reduction and 
show that two of our three algorithms have exactly this bound. 

We summarize run-time and quality results for all three 
algorithms for both the uniform and Zipf distribution (IT) of 
BKVs in a principled fashion. Both distributions are known 
to occur commonly in practice and were recently used in a 
related analytical work on blocking [7|. 

The outline of the paper is as follows. Section [TT] describes 
related work, and Section m describes preliminaries. Section 
IV defines the w-ordering problem and Section JV] presents 


approximation algorithms. Section|Vljlists two conjectures and 
concludes the work. 


II. Related Work 


A. Record Linkage 

As a problem first noted over five decades ago by New- 
combe et al. record linkage has been the focus of efforts 
in structured, semistructured and unstructured data communi¬ 
ties [6j. It is common to separate efforts in the unstructured 
data community, where the problem is commonly called 
co-reference or anaphora resolution from those in the 
structured and semistructured data communities (6). In the late 
1960s, Fellegi and Sunter placed record linkage in a Bayesian 
framework (19) , and the model continues to guide state-of-the- 
art research, which is also influenced heavily by contemporary 
research in the AI community (201. For example, rule-based 
approaches were popular during the 1980s and 1990s [ |2 1 ) , 
but machine learning methods have gained prominence in the 
last decade [|22j. Three recent surveys are by Elmagarmid et 
al. (6], Kopcke and Rahm (23) and Winkler (20) . A generic, 
powerful framework that addresses some of the challenges 
of modern record linkage, both in theory and practice, is 
Swoosh |T). Several open-source toolkits implementing record 


linkage techniques are available to the practitioner; we list 
SecondString (24) and Febrl (25) as good examples. 

Some alternate record linkage models have recently become 
popular, including collective record linkage (26), (27) and 
iterative record linkage (28) . The problem is also important 
in the linked data and Semantic Web community (291, owing 
to documented growth of linked open data (LOEpl). Many 
techniques originally developed for relational databases are 
being adapted for LOD, including rule-based and machine 
learning approaches (30), (31). A full survey on Semantic Web 
record linkage systems" was provided by Ferraram et al. (2). 
Other applications of record linkage include data integration 
(32) , knowledge graph identification (33), and biomedical 
linkage [ [25) . 

Given the expense of record linkage, blocking was rec¬ 
ognized as an important preprocessing step even when the 
problem first emerged (18). The traditional blocking method, 
which is similar to hashing, continues to be popular (7j|. The 
Sorted Neighborhood method was proposed in the 1990s, and 
as noted in Section |I| continues to be used and adapted due to 
its impressive empirical performance (5|. Christen compares 
important blocking methods in his survey (7), in which he also 
verifies the good performance of both Sorted Neighborhood 
and traditional blocking. We note that, while several empirical 
variants of Sorted Neighbhorhood exist, all of them rely on the 
fundamental procedure that was first described in the original 
paper (5). The procedure will be formally characterized in 
Section [IV] 

Finally, parallel and distributed techniques for record link¬ 
age are an active area of research (34), (35). MapReduce has 
emerged as an important paradigm, owing to its documented 
advantages; we refer the reader to the original paper for an 
excellent introduction ID- 

For a synthesis of the multiple threads of record linkage 
research, we refer the reader to the recently published data 
matching text by Christen (8). 


B. Travelling Salesman Problem (TSP) 

Complexity proofs in this paper mainly rely on the Travel¬ 
ling Salesman Problem (TSP), which is among the oldest and 
best studied NP-complete problems (36). The classic version 
of the problem, proposed at least as early as 1954 (37) , is 
min tour-TSP. Specifically, assume a weighted, undirected^and 
complete graph G = (V,E,W) with arbitrary edge weights. 
The problem is to locate a minimum-weight Hamiltonian 
cycle. Even with weights set to either 0 or 1, min tour-TSP 
was shown to be NP-complete, by virtue of a Karp reduction 
from the Hamiltonian cycle problem (36) . TSP for directed 
graphs (also known as asymmetric TSP) was also shown to be 
NP-complete [38). In this paper, we only consider symmetric 
variants and undirected graphs. 

Many variants of TSP have since been shown to be NP- 
complete, including for weight functions that are metric (39) or 
even Euclidean (40) . Two variants of importance herein are the 
min path-TSP and max tour-TSP variants with arbitrary non¬ 
negative weights, both of which will be described in Section 

HQ~E] 

We note that not all TSP variants are equal from an 
approximability perspective. Define the weighj^jof a tour-TSP 
solution to be the sum of weights of all edges in the tour; 
the weight of a path-TSP solution can be similarly defined. 


linkeddata.org 

4 We uniformly use the word weight instead of cost (or score ) since both 
min and max optimization problems are considered in this paper 






















Let the weight of an optimal min tour-TSP solution be </>*. A 
polynomial-time ^-approximation algorithm is an algorithm 
that is guaranteed to find a solution with weight at most 
(or at least , for max variants) pcjf, where p is a constant 
and is denoted as the approximation ratio 6D-. Note that 
p > 1 for min variants and p < 1 for max variants. It is 
known that for min tour-TSP with arbitrary weights, a p- 
approximation algorithm does not exist unless P = NP. 
However, a ^-approximation algorithm exists for max tour- 
TSP with arbitrary non-negative weigh ts (T6| , and also for 
min tour-TSP if the weights are metric j39J] 

The first approximation scheme proposed for metric min 
tour-TSP was by Christofides, with p = 3/2 [391. The diffi¬ 
culty of TSP is attested to by the fact that this approximation 
ratio is yet to be improved. Fortunately, approximation ratios 
continu e to b e updated for max tour-TSP, as described in 
Section | III - B For a full discussion of TSP, we refer the reader 
to the seminal book on the subject by Reinelt (42) . In their text, 
Cormen et al. provide a thorough introduction to the general 
topics of NP-completeness and approximations 


III. Preliminaries 


A. Problem Setting 

The relational data model is assumed in this paper, with 
a brief formalism reproduced for completeness. A relational 
database schema S' is a finite set of relation names. Each 
individual name R' £ S' is associated with a set of attributes. 
An instance S of schema S' assigns to each R' £ S', a finite 
set R £ S of records. For each attribute in the attribute set of 
R' £ S', a record in R £ S either has an attribute value or 
NULL, which is a reserved keyword used to indicate missing 
or non-existent attribute values. 

In this paper, we assume that S = { R } and S' = { R'}. In 
other words, a single instance R is assumed, with name R and 
to > 1 attributes. A single schema is a standard assumption 
in much of existing record linkage literature (6). The original 
Sorted Neighborhood paper additionally assumed only a single 
instance {5]. Typically, if more than one instance is expected, 
possibly with different schemas, a schema integration step 
must be incorporated into the pipeline [43). 

We also assume that the number of attributes (or columns) 
is much smaller than the number of records (or rows), and that 
‘processing’ a record takes constant-time. Three real-world 
examples of such processing are counting tokens in a record, 
generating token initials (as in Table |I|, and generating token 
n-grams. Both assumptions above are standard in the blocking 
community when analyzing blocking methods |7j. 


B. Travelling Salesman Problem (TSP) variants 

In Section |II-B| we noted that the symmetric TSP variants 
take as input a complete, undirected and weighted graph G = 
(V. E, W) without self-loops. Define a Hamiltonian path as a 
path that includes every vertex in the graph exactly once [ [36) . 

Define the problem of finding a minimum-weight Hamilto¬ 
nian path in G as the min path-TSP (T5) . The decision version 
of the problem instance aditionally accepts an integer k, and 
needs to determine if a Hamiltonian path with cost at most 
k exists. Three versions of the problem have been studied, 
and all are NP-complete {44) . In the first version, which is of 
primary concern in this paper, path endpoints are not specified 
and the algorithm returns True if there exists any Hamiltonian 
path with cost at most k. In the other two versions, one or both 
endpoints are respectively given. These versions are therefore 
more constrained than the first version. 


Hoogeveen adapted Christofides’ cubic 3/2-approximation 
algorithm for all three versions, and showed that the 3/2 bound 
held for the first two versions {44). For the third version (both 
endpoints specified), he showecTa 5/3 bounc0 In 2012, An et 
al. improved the 5/3 bound to 1+ 2 V ^ [ 45 [. In the most recent 
work we are aware of, Sebo improved this bound even further 
to 8/5 (46). Hoogeveen’s original conclusion on the difficulty 
of the problem (compared to the tour problem) still stands, 
since all these bounds are greater than 3/2, which has yet to 
be improved, to the best of our knowledge. 

The max version of tour-TSP is similar to min tour-TSP, 
except that the problem is to locate a maximum-weight 
Hamiltonian circuit, with the weight function assumed to be 
non-negative 63- Despite being seemingly similar, max tour- 
TSP turns out to be easier to approximate than min tour-TSP; 
the currently best known deterministic algorithm runs in cubic 
time (in the number of vertices) and has approximation ratio 
61/81 for an arbitrary non-negative weight function (16) . 

More importantly, we note that max tour-TSP and its vari¬ 
ants continue to invite improvements (48) , fl6) , and that the 
weight function does not have to be metric for a polynomial¬ 
time approximation algorithm with constant approximation 
ratio to be devised. 

Unlike min TSP, max TSP approximations are appropriate 
only for the tour versions. For our purposes, we will use the 
first version of min path-TSP to show NP-completeness of 
maximum-score 2-ordering in Section [TV] while approximate 
solutions to max tour-TSP will be used for devising approxi¬ 
mate solutions to maximum-score 2-ordering in Section [V] 


IV. The w-ordering problem 
A. Sorted Neighborhood 

To begin. Sorted Neighborhood assumes a blocking key to 
be given. For clarity, the functional definition of a blocking 
key is provided below: 

Definition 1. Given a set R of records and an alphabet E, 
define a blocking key to be a function b : R -y E* 

Let h(r) (for some r £ R) be denoted as the blocking key 
value (BKV) of r. Given a finite set of records R and blocking 
key b, let Y be denoted as the set of BKVs for R. Note that 
Y\ < |i?|. The inequality is strict if more than one record has 
the same BKV. 

Assume a total order on E*, and by consequence, Y. 
In keeping with the earlier assumption in Section |ll]-A that 
processing a record is a constant-time operation, BKV com¬ 
putation should not be an expensive operation (5), (7). 

Given the run-time of b to be t(b) per record, an SN 
algorithm would first generate Y in time 0(\R\t(b)), and then 
convert R into a sorted list, R l , using the BKVs in Y as 
the sorting keys. Assuming a comparison sort, the step would 
take 0(\R\log\R\) and was found to be the most expensive SN 
step in practice (5|, (7). This also implies that t(b) is usually 
o(log\R\). Henceforth", we consider t(b) to be 0(1). In Section 
fv) we lift this assumption and consider arbitrarily expensive 
mocking keys when we present and analyze approximation 
algorithms. 

In the merge step, the w-window is slid from the first record 
in R l to the last record in exactly | | — w + 1 sliding steps. 

In each such step, pair the first record in the window with 
all other records sharing the window, to add exactly w — 1 


5 This surprising result showed that finding a constrained path is harder 
than finding a tour, from an approximability perspective (44) 










unique pairs to the candidate set I . In the final sliding step, 
pair every record in the window with every other record to 
generate w(w — l)/2 pairs, ensuring that alrl records sharing 
a window are paired and added to F [7J. 

An advantage of SN is that |F| exactly equals (|| — w)(w— 
1) + w(w — l)/2 and is a deterministic function of w. It is 
independent of the blocking key b , and the actual distribution 
of BKVs that b generates. Referring to Table FT] again, consider 
the merge step for w = 3. There would be 7—d+1 = 5 sliding 
steps. In the first four steps, two unique pairs are generated 
and added to T. In the final step, three pairs are generated. T 
contains (7 — 3) * 2 + 3 * (3 — l)/2 = 11 pairs. 

Let T m be the subset of true positives included in T. The 
Pairs Completeness (PC) of T is defined as |r m |/|T| [491. The 
metric is commonly used to evaluate blocking procedures and 
is an indication of the coverage or recall of the candidate set 
|7). As described informally in Section [fl the sorting of R l 
can make a difference to PC, if it results in true positives 
getting left out of T. Since Y already has a total order, the 
problem occurs if records share the same BKV. Suppose a set 
of q > 1 records R y = {ri,... ,r q } have the same BKV y. 
Notationally, the q records are said to fall within the same 
block R y , identified by the BKV y [7j. 

To break ties, an additional input is required, similar to the 
blocking key b, but operating at a finer level of granularity. 
This motivates us to define a scoring heuristic f: 

Definition 2. Given a set R of records, define a scoring 
heuristic on unequal inputs to be a symmetric function / : 
R x R —> R + U {0}, with run-time per invocation bounded 
above by 0(\R\ C ) for some constant c. Vr £ R, f(r,r ) is 
undefined. 

Given a set of pairs (for example, the candidate set T), the 
score of that set can be computed by calculating and summing 
the score of every pair in the set. Given a list of records 
R l , the score of the list will depend on the window size w. 
Specifically, the merge step will first have to be run on the 
list and the score, calculated for the generated set of pairs. 
Algorithm |T] summarizes the process. We designate the score 
of the list returned by Algorithm [T] as the w-score, since the 
score depends on w. 

A maximum-score w-ordering for a set R of records can 
now be defined: 

Definition 3. Given a set R of records, a constant window 
parameter w, and a scoring heuristic /, define the maximum- 
score w-ordering for R to be an ordering of R given by the 
list R l , such that a strictly higher w-score exists for no other 
ordering. 

Intuitively, while (without imposing additional constraints) 
it is incorrect to think of / as a probability density function, 
a high score on an input record pair indicates a high degree 
of belief that the pair should be included in T. Although 
we have not defined / to be dependent on b in any way, a 
practical design would probably consider both functions in 
tandem. We further note that even though / is restricted to 
run in polynomial time, it would, in practice, be expected 
to be inexpensive, similar to the blocking key b. Many such 
heuristics have been documented in the literature |8J, two 
good examples being token-Jaccard and cosine similarity. 

6 In a simplified implementation, the last window would not be treated 
differently from the other windows |Tj. This will not change the analysis 
since it only removes a constant additive term 

7 On the part of the domain expert who provided/ and b 


Both functions are commonly in use in the record linkage 
community, and are known to work well in a variety of 
blocking scenarios |8|. 

Recall that Sorted Neighborhood first assigns each record 
a BKV and generates a set of blocks, where each block is 
a set R y of records sharing the same BKV y. let us assume 
that a total of u BKVs were generated and that the set Y of 
BKVs is {t/i,..., y u }. We also noted that Y (and therefore, 
u ) can be at most \R\ because of the functional definitiorr] of 
the blocking key in Definition |T] Let the total order on V be 
2 /i < ... < y u and the sorting order be ascending. After the 
BKV generation and sorting phase, we are left with an ordered 
list of blocks < R Vl ,..., R Vu >. 

Before running the merge (or sliding window) step, each 
block should ideally be ordered so that the w-score of the 
resulting ordered list of records R l is maximized for given w 
and /. This is a constrained ordering problem; the maximum- 
score w-ordering of the full set R of records might yield a 
list that potentially disobeys the total order imposed on Y. In 
other words, the list could yield a candidate set that is not a 
valid Sorted Neighborhood output, given the inputs. Given this 
observation, we define a maximum-score Sorted Neighborhood 
(max SN) as follows: 

Definition 4. Given a scoring heuristic /, windowing constant 
w and blocking key b, define maximum-score Sorted Neigh- 
boorhood (max SN) as a Sorted Neighborhood algorithm that 
generates (from all valid candidate sets) a candidate set T with 
maximum score. 

Since max SN is an SN algorithm, it must obey the 
ordering constraint just described. Given the various inputs, the 
candidate set output by max SN is the best result achievable 
for the SN blocking method. Note that changing any of these 
inputs (while keeping R intact) can lead max SN to output a 
different candidate set. 

Furthermore, depending on how the scoring heuristic is 
defined, a max SN algorithm can be realized in two different 
ways. First, define a local scoring heuristic as a scoring 
heuristic that is constrained to returning 0 for every record 
pair (r, s) such that r and s have different blocking key 
values. On the other hand, a global scoring heuristic has no 
such constraint, except for the ones imposed in the original 
Definition U 

For local /, we can claim the following: 

Theorem 1. If / is local and the window size is w, maximum- 
score w-ordering each block independently in the ordered list 
of blocks < R Vl ,..., R Vu > is both necessary and sufficient 
for max SN. 

Proof: In Appendix. ■ 

The proof sketch of this theorem is fairly intuitive; the key 
observation to note in proving the claim is that if a block were 
not maximum-score w-ordered, then such an ordering would 
yield a strictly higher score for F. Since paired records not 
from the same block have score 0, such an ordering cannot 
influence the scores contributed by neighboring blocks. The 
claim also demonstrates why we designated this particular 
definition of / as local, since each block can be maximum- 
score w-ordered locally, in order to achieve a global maximum. 

We show that, even for w = 2 and some integer k', 
merely determining the existence of an ordering for a set R 

8 Many-many blocking keys do exist beyond the scope of Sorted Neighbor¬ 
hood and are used in some modern blocking methods J7] ; we do not consider 
them in this paper 


Algorithm 1 Compute w-score for record list R l 

Input : 

• A list R l of records 

• A windowing constant w 

• A scoring heuristic / 

Output : 

• A real valued w-ordering score w-score 

Method : 

1) Initialize empty set of pairs T 

2) Initialize w-score to 0 

3) if |f? ; | < w then 

r = {{r, s}|r ^ s}, r, s are records in R l 
Goto line 8 

4) end if 

5) for all* G {1,.. . \R l \ - w + 1} do 

for all j G {i + 1, ... i + w — 1} do 
T = TU{{R l \i],R l [j}}} 

end for 

6) end for 

7) r = ^ jM,j g -w + 

2,...|^|}} 

8) for all {r, s} £ T do 

w-score=w-score+f(r , s) 

9) end for 

10) Output w-score 


of records, such that the ordering has w-score at least k', is 
an NP-complete decision problem. We call this problem the 
maximum-score w-ordering problem, since an oracle to the 
problem can be used to find such an ordering, similar to related 
oracles for other NP-complete decision problems. 

Theorem 2. Maximum-score 2-ordering of a set R of records 
is NP-complete. 


Proof: We show a polyno mial-t ime reduction from min 
path-TSP, introduced in Section |III-B| The version used in this 
proof is that of both endpoints being unknown. Recall that 
the decision version of the problem statement is to determine 
if a Hamiltonian path with cost at most k exists in a given 
complete, undirected, weighted graph G = (V,E,W). The 
problem is known to be NP-comnlete even if weights are non- 


imgiete e\ 


negative integers, as we assum 

We begin the reduction by byectively mapping each vertex 

v £ V to a record r and placing all mapped records in a 
set R. Suppose \V\ = m > 2. Then the set R contains the 
records {ri,... ,r m }. Define the non-negative integer We to 
be ^ B W"(e) where W{e) is the weight of edge e. We 
is a non-negative integer because all weights were assumed 
to be non-negative integers. We construct / as a symmetric 
look-up table as follows: the score between any two distinct 
records r, and rj ( i j, i,j G {1,..., m }) is simply We — 
W({v.i,Vj}), where v, and v 3 are the corresponding vertices. 
Constructed this way, / is both symmetric and non-negative 
and is therefore an eligible scoring heuristic. / also runs in 
polynomial-time, since each look-up requires (at worst) a pass 
over a table occuping 0(\R\ 2 ) space. 

The entire construction takes quadratic time. We query the 
maximum-score 2-ordering oracle for the existence of a list 


with 2-score at least k!, with k! = We (to — 1) — k + 1 (recall 
that to = | V | = |U|). k' is an integer, since We is an integer. 
We claim that the (boolean) output of the oracle is also the 
output of the original min path-TSP problem instance; hence, 
we claim a correct Karp reduction. 

First, we prove correctness for True oracle outputs. A True 
output implies that a list with 2-score at least k! exists; let the 
list be < n,..., r m > without loss of generality. The 2-score 
of this list (per Algorithmlllsemantics) is T^ 1 f( r ii r i+ i)- 

By construction, /(j-j, r i+ ij = We~W{{ vi,v i+ i}) and there¬ 
fore, Ezr 1 =Ezr\w E -w({v l ,v i+1 })). 

Because We is independent of i and the summation is 
over to — 1 elements, we can rewrite the right hand side 
as W E {m - 1) + Si=r _1 W({vi,v i+ i}). Since the or¬ 
acle returned True, this quantity is at least k'\ in other 
words, W E (m - 1) + W({w,w»+i}) > k' = 

WE{rn — 1) — k + 1 by definition of k' above. This in 
turn implies that ^{{ * * * * v ii v i+ 1 }) > —& + 1 or k < 

12ZT w{{vi,v i+ 1 }) + 1. Since k is an integer, this shows 
that W(fvi, i>i+i}) < k. But this implies that a 

Hamiltonian patl||j < rq, {rq, v 2 }, v 2 , ■ ■ ■, {v m -i, v m }, v m > 
with weight at most k exists. 

If the oracle returns False, we can use the same sequence of 
equations and a proof by contradiction to show that it cannot 
be the case that a Hamiltonian path with weight at most k 
exists in the input graph, since if it did, a corresponding list 
can be constructed with 2-score at least k!. Together, these 
arguments show that the proposed Karp reduction is valid. 

Finally, to show that the maximum-score 2-ordering prob¬ 
lem is in NP, we accept a list R l as a certificate, input 
R l (with w = 2) to Algorithm [I] and compare the 2-score 
output by the algorithm to the input decision constant k'. 
For constant w and polynomial-time /, Algorithm [T] runs in 
(polynomial) time 0(t(f)\R l \). The verification algorithm is 
therefore polynomial and maximum-score 2-ordering is in NP. 
In combination with the first part of the proof, we conclude 
that maximum-score 2-ordering is NP-complete. 


Corollary 1. Maximum-score 2-ordering of a set R of records 
is NP-complete for a scoring heuristic / of the form / = 1— 
where /' is metric with range [0,1], 

Proof: We reduce from min metric path-TSP (again with 
both endpoints unknown), which is also known to be NP- 
complete Jl5|. We construct / to have the form 1 — /' (instead 
of subtracting from W E , the sum of all edge weights), where 
/' is the metric weight function of the input graph G. Also, k' 
in Theorem [2] is set to (in — 1) k + 1. Otherwise, the proof 
remains virtually the same as that of Theorem [2] ■ 

The form above is important because many ofthe similarity 
functions in the literature adhere to it |8j. Examples include 
Jaccard and cosine similarities, whose aistance versions are 
known to be metric [8j. The corollary shows that, even for 
this special case, the problem is no easier. 

This is also the case as w increases, assuming w is still 
a constant. Note that, while Theorem [2] can be used to 
prove that maximum-score w-ordering is NP-complete for an 
arbitrary constant w, it does not prove NP-completeness for 
any constant w > 2. To understand the difference, consider 
the classic NP-complete problem of proving satisfiability of 
a k-CNF formula, for an arbitrary k > 2. This problem is 


9 It stays NP-complete even if weights are only 0-1 by reducing from the 
Hamiltonian path problem |[36| 


10 Graph theoretically, a path is defined as an alternating sequence of vertices 
and edges |36| 







( 1 ) /( 4 , 5 ) < /( 4 , 6 ) 

( 2 ) /( 2 , 3 ) + / ( 4 , 5 ) < 

/ ( 3 . 5 ) + / ( 2 , 4 ) 


Fig. 1. A construction showing why maximum-score w-ordering each block 
separately is neither sufficient (given (1)) nor necessary (given (2)), assuming 
w = 2, an arbitrary / and that each block in the figure is maximum-score 
2-ordered 


NP-complete, since 3-CNF satisfiability is known to be NP- 
complete. However, this would not be true for any constant 
k > 2, since 2-CNF satisfiability is known to be in P |36|. 

Reducing the ordering problem from typical variants of min 
TSP is not straightforward for w > 2, since we relied on the 
fact that w = 2 in the problem reduction. Instead, we reduce 
the maximum-score 2-ordering problem, which Theorem [2] 
showed to be NP-complete, to the maximum-score w-ordering 
problem for a constant w > 2 by using a polynomial-time 
scaling mechanism. The proof is quite technical; we reproduce 
it in the Appendix for the interested reader. 


Theorem 3. Maximum-score w-ordering of a set R of records 
is NP-complete, for any constant w > 2. 


Proof: In Appendix. ■ 

Corollary 2. Maximum-score w-ordering of a set R of 
records is NP-complete, for any constant w > 2. 

Mirroring the case of Corollary [T] the ordering problem for 
w > 2 continues to be NP-complete even for metric /. The 
technical details are not repeated here. 

Devising a max SN algorithm even for local scoring heuris¬ 
tics becomes an NP-hard problem because of Theorem[3] since 
each independent block needs to be maximum-score w-ordered 
(Theorem [TJ. The problem is no easier if the heuristic is 
global, since if a polynomial-time oracle exists for an arbitrary 
heuristic /, it can be used to solve the special case of local /. 

There are other consequences of having a global scoring 
heuristic. First, Theorem [I] is no longer true. A simple con¬ 
struction in Figure [I] illustrates why. Consider just two blocks 
that have been individually maximum-score 2-ordered. By 
itself, the first condition (/(4,5) < /(4, 6)) in the construction 
shows that this is no longer sufficient, since reversing records 
5 and 6 will still be a maximum-score 2-ordering for Block 
2, but the score of the overall candidate set will be higher. By 
itself, the second condition shows that the maximum-score 
2-ordering is also not necessary. Intuitively, if the scores are 
high for some pair of records straddling blocks, then this could 
theoretically be enough to compensate for any gains that could 
be achieved from local maximum-score 2-ordering that does 
not place those records at the block boundaries, as in the 
example. 

Thus far, we assumed a single blocking key b and single¬ 
pass Sorted Neighborhood, but the analysis can be extended 
in a straightforward way to multi-pass Sorted Neighborhood. 
Specifically, assume a set B = 6 c }of c blocking keys, 

where c > 1 is some constant. For each key b L £ B, the single¬ 
pass SN procedure is run and a candidate set F, is output. In 


this way, c independent passes are run, and c candidate sets are 
output. The final candidate set output by the entire procedure 
is simply the union of all c sets p), (9|. 

While the asymptotic analysis does not change, multi-pass 
SN runs slower (in practice) by a factor of c on a serial 
architecture. Because passes are independent, both multi-core 
and shared-nothing parallel architectures are appropriate for 
the problem [9|, 1 14) . We can define max multi-pass SN as 
a multi-pass SN algorithm with each individual pass meeting 
the requirement set in Definition [4] Note that the windowing 
constant w is assumed to be fixea over all passes, but each 
pass can have its own scoring heuristic. Formally, each pass 
now takes a pair < b, f > as input, with a total of c distinct 
pairs for a c-pass procedure. This further implies that there 
need not be c distinct blocking keys and scoring heuristics, 
merely c distinct pairs. 

Multi-pass SN has widely emerged as the method of choice 
(over single-pass SN) both because of parallelism and also 
because it was experimentally verified to increase the recall 
of F, mainly due to using a diverse set of blocking keys j9j. 
In combination with a small window constant w, inclusion of 
false positives in the candidate set was also found to be greatly 
reduced ®. 

B. Traditional blocking 

Although the primary focus of this work is Sorted Neigh¬ 
borhood, we briefly show how the ordering problem arises in 
the hash-based traditional blocking method, which continues 
to enjoy popularity due to its simple implementation (7). In 
the original version, a functional blocking key is assumed and 
each record is assigned a single BKV, exactly like in SN. 
However, a total order is not assumed on the set of BKVs Y, 
which implies that the records cannot be sorted. Instead, each 
block R y is treated like a hash bucket, and the hash key is 
simply the BKV y. Records sharing a block are paired and 
added to T. 

This last step leads to problems when we consider the issue 
of data skew [!7j. If some block contains far too many records 
compared to other blocks, pairing records in that block will 
dominate run-time. Sorted Neighborhood systematically dealt 
with the issue by using a constant window size in the merge 
step. To address the same issue in traditional blocking, several 
ad-hoc techniques have been proposed [50) . 

In our recent work, we used the same sliding window proce¬ 
dure as SN in each block and generated the resulting candidate 
set ED- We showed competitive empirical performance of the 
technique (on standard benchmarks) even with a simple token- 
based blocking key. To distinguish this procedure from the SN 
merge procedure, we designate it as the block-merge step, since 
the merge step is run on each block in isolation. 

It is straightforward to adapt the formalism presented thus 
far, if we assume the block-merge procedure for controlling 
data skew in traditional blocking. Maximum-score traditional 
blocking (or max traditional blocking) can be defined in the 
same vein as in Definition [4 Note that since a total ordering 
on Y (and hence, stacking or blocks against each other) does 
not exist in traditional blocking, the difference between global 
and local / does not arise. 

Finally, we can state a version of Theorem [T| for max 
traditional blocking. 

Theorem 4. For any constant w, scoring heuristic /, and a set 
n of blocks generated by a functional blocking key, maximum- 
score w-ordering each block in n individually is both neces¬ 
sary and sufficient for max traditional blocking, assuming the 
block-merge procedure for generating the candidate set. 











Block 1 


Block 2 


Maximum-score 2-ordering Maximum-score 2-ordering 
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1 
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4 
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A/on-maximum-score 2- 
ordering 


Score matrix 


f(l,2) 

2 

f(2.3) 

7 

f(1.3) 

3 

f(3,4) 

4 

f(l,4) 

1 


Candidate set: 

(1) With Block 2 maximum-score2-ordered: 
{{1,3},{3,2},{3,4}}, w-score=14 

(2) With Block 2 not maximum-score 2-ordered : 
{{1,3},(3,2 m!, 4},{4,3}}, w-score=15 


Fig. 2. A construction showing that Theorem^Jdoes not hold for many-many 
traditional blocking. The maximum-score 2-ordering of each block and the 
2-scores may be verified by a brute-force calculation using the provided / 


The proof is quite similar to that of Theorem [I] we do not 
repeat it. 

Interestingly, even though the difference between global and 
local / does not arise for traditional blocking, a related issue 
arises if we consider traditional blocking with non-functional 
blocking keys. Such keys can assign multiple blocking keys to 
a record. Just like Theorem[l]was shown not to hold for global 
/, a simple construction (Figure |2]» shows that Theorem [4] does 
not necessarily hold for many-many traditional blocking. We 
leave for future work to determine the appropriate conditions 
for guaranteeing an optimal candidate set both for many-many 
traditional blocking, as well as Sorted Neighborhood with a 
global scoring heuristic. 


V. Approximate Solutions 

Theorem [2] showed that maximum-score 2-ordering is NP- 
complete. Tnus, the next best course of action is to devise 
polynomial-time approximation algorithms, preferrably with 
good constant bounds. Given the close connection between 
the 2-ordering problem and TSP , and the progress in max 
tour-TSP approximations (Section |HI-B|), a natural question to 
ask is whether maximum-score 2-ordering can be reduced to 
the appropriate max tour-TSP version. We show subsequently 
that this is feasible, and we utilize this reduction in the 
approximation algorithms proposed in this section. 

In the rest of this section, we explicitly assume w = 2. We 
present three approximation algorithms, with two algorithms 
addressing multi-pass Sorted Neighborhood for local and 
global scoring heuristics respectively, and one MapReduce 
algorithm for traditional blocking with the block-merge pro¬ 
cedure and with w = 2. There are a number of reasons why 
we focus exclusively on the w = 2 case. 

First, the complexity of multi-pass Sorted Neighborhood, 
even while neglecting the ordering problem, is known to be 
0{c{\R\log\R\ + u>|J?|)), where c is the number of passes, w 
is the windowing constant and R, the input set of records [5J. 
Even though c and w are constants, they cannot be neglected 
in practice, as the experiments in the original paper showed 
['51. It was found in the experiments that both run-time and the 
false-positives included in the candidate set increased rapidly 
with w. The conclusion was that, even for a test database with 
slightly under fourteen thousand records, the recall of F with 
c = 3 achieved a high value at w = 2 and remained virtually 


Algorithm 2 Multi-pass Sorted Neighborhood, local / 

Input : 

• Set R of records 

• Set C containing c blocking key and local scoring 
heuristic pairs 

Output : 

• Candidate set of pairs F 

Method : 

1) Initialize empty candidate set F 

2) for all pairs < b, f >£ C do 

F := ruAlgorithm[3j:i?, f, b ) 

3) end for 


flat thereafter. The recall was higher than with c = 1 and with 
w set to high values. We cited this result as a motivation in 
Section JTJ and it is the main reason we focus on maximizing 
the performance of multi-pass SN for w = 2. 

The second problem with assuming an arbitrary w is that 
TSP is no longer applicable. A reduction either to or from TSP 
is not evident for w > 2. The approximability status of the 
problem is also unknown, in the absence of a clear reduction. 

For these reasons, we leave devising approximations for 
w > 2 for future work. In the rest of this section, assume that 
a polynomial-time ^-approximation algorithm for max tour- 
TSP is available as a subroutine, MAX TOUR-TSP. We have 
already cited one such algorithm in the literature 03 but in 
general, any appropriate max tour-TSP approximation algo¬ 
rithm may be used. We note that if a randomized subroutine 
is used, then proposed algorithms also become randomized 
and approximation ratios are expected , rather than guaranteed. 

Finally, the analysis will depend on the distribution of BKVs 
generated by the blocking keys input to the algorithms. We 
will conduct the analysis for both the uniform distribution as 
well as the Zipf distribution ED- The uniform distribution 
is the ideal case, since it assumes no data skew and all 
blocks have the same number of records. The Zipf distribution 
involves a realistic amount of data skew. For this reason, both 
distributions were taken into account in a recent survey of 
blocking methods J7J. 

To describe the Zipf distribution, let H u be denoted as the 
partial harmonic sum for some positive integer u: 


Hu = Y,- 


i—1 


(i) 


Given a set R of records and a blocking key that assigns BKVs 
to records according to the Zipf distribution, let y = it, that 
is, u blocks are generated. In descending order by size, the 
m th block will have size \R\/(mH u ) 

Attribute values in many practical databases have been 
known to occur with Zipf-like frequency, including US and 
Chinese firm sizes |52J, ]53| , and more importantly, personal 
names p4) . The analysis assuming the Zipf distribution for 
blocking key values is therefore expected to match real-world 
scenarios more closely than the uniform distribution. 


A. Multi-pass Sorted Neighborhood with local scoring heuris¬ 
tics 

Algorithm [2] presents the pseudocode for multi-pass Sorted 
Neighborhood"^that takes as input a set R of records and a set 
C of c pairs, with each pair comprising a blocking key and 
a local scoring heuristic. From the discussion at the end of 






















(a) 


(b) 


Fig. 3. The two conversion subroutines used in Algorithm^] (a) illustrates RecordsToGraph and (b) illustrates TourToList 


Algorithm 3 Single-pass Sorted Neighborhood, local / 

Input : 

• Set R of records 

• Local scoring heuristic / 

• Blocking key b 

Output : 

• Candidate set of pairs F 

Method : 

1) Initialize empty multimap M containing pairs of key- 
value-se/s, with the key being a BKV and the value-set, 
a set of records 

2) Initialize empty set of BKVs Y, and empty list Y l 

3) Initialize empty list of records R l 

4) Initialize empty candidate set V 

5) for all records r £ R do 

Apply b on r to get BKV b(r) 
if b(r) £ keyset(M) then 
Add pair < b(r), {} > to M 
end if 

Add r to M[b(r)\, the value-set associated with b(r) 

Add b(r) to Y 

6) end for 

7) Sort Y using total order to get list Y l 

8) for (in-order) y £ Y l do 

Let G := RecordsToGraph(M[y\, /), where M[y\ 

is the value-set associated with key y 

Call MAX TOUR-TSP on G 

Call TourToList on TSP output to get list of records 

K 

Append R l y to R l 

9) end for 

10) Run merge procedure on R l with w = 2, populate L 

11) Output r 


Section [Tv] this does not imply c unique blocking keys and c 
unique scoring heuristics. For each of the c pairs, Algorithm[2] 
invokes Algorithm |3]and forms the union of the candidate set 
output by Algorithmpland the current candidate set maintained 
by Algorithm[2] Eacniteration of the loop in line 2 is therefore 
a pass in the Sorted Neighborhood sense. 


Algorithm [3] presents the pseudocode for approximating 
a solution to single-pass SN that accounts for 2-ordering, 
assuming an arbitrary local scoring heuristic. The algorithm 
begins (lines 1-4) by initializing some data structures, in¬ 
cluding a multimap with BKVs for keys and with each key 
pointing to its associated blockp^j Lines 5-6 perform the BKV 
computation and block generation step, while line 7 uses 
the total order to get a sorted list Y 1 of BKVs. The list 
is traversed in order, and for each BKV y in the list, the 
block R y is converted into an undirected, complete, weighted 

S aph using the auxiliary subroutine RecordToGraph. Figure 
a) illustrates the functionality of the subroutine; we do 
not provide the technical pseudocode here. Specifically, each 
record is bijectively mapped to a vertex. In addition a dummy 
vertex is also created. The weight of any edge between two 
distinct non-dummy vertices vi and V 2 is simply /(r - !,^), 
assuming records rq and r-> were mapped to vertices v\ and 
V 2 respectively. The weight of any edge between the dummy 
vertex and any non-dummy vertex is 0. 

MAX TOUR-TSP is then invoked on the graph, and a 
Hamiltonian circuit is output. A subtle point to note is that 
MAX TOUR-TSP must work for arbitrary (that is, not nec¬ 
essarily metric) non-negative weight functions, regardless of 
whether the scoring heuristic is metric or non-metric. This 
is because adding the dummy vertex necessarily makes the 
weight function non-metric, assuming at least one non-zero 
weight. Consider an edge -jVi, tq } that has non-zero weight. 
In the constructed graph, the three edges connecting vertices 
V \, V 2 and dummy will not satisfy the triangle inequality since 
the dummy edges are 0. Thus, the 0 sum of the dummy edges 
will be strictly less than the non-zero weight of the third edge, 
which is a violation of the triangle inequality. 

This is one motivation for using a max TSP subroutine, 
since approximation algorithms for max tour-TSP exist that 
do not place metric assumptions on the weight function m- 
To leverage better bounds for metric weight functions, a max 
path-TSP algorithm is required. To the best of our knowledge, 
none of the max tour-TSP algorithms recently proposed in 
the literature have been adapted (or can be easily adapted) to 
solve the path version {16) . However, it is evident that such 
an algorithm can be used in Algorithm [3] (in place of MAX 
TOUR-TSP) if a user so desires, by modifying RecordsTo- 
Graph so that the extra dummy vertex is not constructed. 
TourToList , illustrated in Figure |3jb), is another straightfor- 

1 'Which is the key’s value-.se/: hence, the term multimap 






















ward auxiliary subroutine that is invoked on the Hamiltonian 
circuit output by MAX TOUR-TSP. Construct the list by 
starting (and ending) the circuit output by TSP from the 
dummy vertex, after which the dummy vertex is discarded. The 
list of vertices is reverse mapped to the list of corresponding 
records. An interesting point is that, in the example in Figure 
[3] (b), both (2,1,3,4) and (4,3,1,2) have equal w-scores. This 
is generally true, given that the graph is undirected. If / is 
local, the choice of ordering will not matter, and TourToList 
can arbitrarily return one of the two. For global /, the choice 
will matter, an issue we address in the next section. Line 10 
runs the sliding window procedure on the generated list R l 
and the populated candidate set is output. 

The run-time of Algorithm [3] depends on several factors, 
including the run-time of the blocking key b per record pair 
and the distribution of BKVs generated by b. Let the per- 
invocation run-time of b be denoted as t(b). Assume that b 
generates u blocking key values. That is, each BKV y is 
assumed to refer to a block R y containing an equal number 
(= \R\/u) of records. Finally, assume the amortized run-time 
of / to be f (/) per record pair. Amortization is important 
for some commonly encountered scoring heuristics like cosine 
similarity that have a start-up phase where token statistics 
(such as frequencies of tokens) need to be collected and stored 
over the set of records [55]. Once this phase has concluded, 
computing the similarity takes time that is near-constant, since 
it only involves a look-up and a simple calculation (like a dot 
product). 

Regardless of BKV distribution, lines 1-7 in Algorithm [3] 
will run in time 0(t(b)\R\ + ulog(u)). For the remaining 
analysis, consider first the case of uniform distribution. That 
is, each of the hu blocks have an equal number of records, 
which is \R\/u. We neglect issues due to rounding for the sake 
of analysis. 

The time taken by RecordsToGraph on a single block is 
0((]R\/u) 2 t{f)). Assume the TSP approximation subroutines 
to run in time 0{\V\ q ) = 0((\R\/u) q ) where q is a constant 
and is denoted as the TSP constant. Typically, q < 3 p6) . 
The merge step (line 10) takes time 0(|i?|), given it involves 
exactly \R\ — 2 + 1 = |f? — 1 sliding steps. Thus, the 
total run-time of Algorithm 3] for a uniform blocking key 
is 0(u((\R\/u) 2 t(f) + ( \R\/u ) q ) + (t(b) + l)|f?| + ulog(u)) 
and the generated candidate set has size exactly |i?| — 1 or 
0(|i?|), since each sliding step generates exactly one pair. 
Since u = ()(\ R\) due to the functional definition of a 
blocking key, the expression above may be simplified further 
as 0(u((\R\/u)H(f) + (\R\/u)«) + (t(b) + l)|i?|). 

If we conduct the same run-time analysis for Zipf dis¬ 
tribution, the complexity of merge (and the candidate set 
size) remains the same, which we noted at the beginning 
of Section IV as an advantage of Sorted Neighborhood. 


Specifically, the size of F is never dependent on the block¬ 
ing key. However, the total run-time of Algorithm [3] will 
change, since the TSP subroutine takes as input the graph 
representation of a single block in each invocation. In accor¬ 
dance with the Zipf distribution, the run-time of Algorithm 
[3] becomes 0(Ef=i((IW(*#u)) 2 f(/) + (\R\/(iH u )) q ) + 

t m + m\ + uiog{ U ))=o^:um\im u ))H{f) + 

(\R\/(iH u )) q ) + (t(b) + l)|f?|). We can rewrite the first two 

terms as t(f){\R\/H u ) 2 EEi V * 2 + {\ R \/ H u) q EEi V* 9 - 
Unfortunately, the summations do not have closed forms, and 
we cannot simplify the expression further. 

For a comparison to the run-time of the same Sorted 
Neighborhood procedure that neglects the ordering problem. 
As for qualitative performance, we can prove the following 


Algorithm 4 MapReduce-based Traditional Blocking 

MAP: 

Input : 

• Set R of records 

• Blocking key b 

Output : 

• Set of key-value pairs of form < b{r),r > where b(r) 
is the BKV of record r 

Method : 

1) for all r £ R do 

Emit < 6(r),r > 

2) end for 
REDUCE: 

Input : 

• Key-value set of form < y, R y > with the set R y 
contains exactly those records with BKV y 

• Scoring heuristic / 

Output : 

• Set of key-value pairs of form < s , q > where s is a 
double-valued score of unordered record pair q = {r, s} 

Method : 

1) G := RecordsToGraph(R y , /) 

2) Call MAX TOUR-TSP on G~ 

3) Call TourToList on TSP output to get list of records 

R'v 

4) Run block-merge procedure on R l with w = 2 

5) for all pairs {r, s} output by block-merge procedure 

do 

Emit < /(r, s), {r, s} > 

6) end for 


about Algorithm [3] 

Theorem 5. The approximation ratio of Algorithm [3] is 
exactly the approximation ratio of MAX TOUR-TSP. 

Proof: In Appendix. ■ 

The run-time of the multi-pass procedure in Algorithm [2] is 
c times the run-time of Algorithm Rl if all the blocking keys 
have the same BKV distirbutions. This is unlikely, since one 
of the strengths of the multi-pass procedure is to accommodate 
diverse blocking keys. Nevertheless, assuming that BKVs 
generated by the blocking keys in the pairs in C individually 
obey either the uniform or Zipf distribution, the run-time of 
Algorithm Rl is simply a weighted sum (with weights adding 
up to c) of the appropriate Algorithm R] run-times. If other 
distributions (not considered in this paper) are accommodated, 
the analysis can be extended in a straightforward manner. We 
can also state the following corollary: 

Corollary 3. The approximation ratio of Algorithm [2] is 
exactly the approximation ratio of MAX TOUR-TSP. 

The proof of the corollary is self-evident when we consider 
the pseudocode of Algorithm [2] and Theorem [5] 

B. MapReduce-based Traditional Blocking 

The pseudocode for MapReduce-based traditional blocking 
is shown in Algorithm!?] In the mapper, the blocking key b is 
applied on each recordand the key-value pair < b(r) , r > is 
emitted. This implies that, after the shuffling step, all records 








with the same BKV end up in the same reducer. Thus, each 
reducer instance processes a single block. Steps 1-5 in each 
reducer instance mirror the for loop in line 8 of Algorithm J3] 
Finally, the block-merge procedure, which we described earlier 
as sliding a constant-sized window w over the records in the 
block and generating the candidate set thereof, is run on each 
list output by the TSP subroutine. As in the rest of this section, 
w = 2. 

Given that Algorithm [4] is designed for traditional blocking 
and not Sorted Neighborhood, it does not make any difference 
whether / is local or global. Note that the proof of Theorem 
[5] can be used, with trivial modifications, to prove that the 
approximation ratio for Algorithm [4] is exactly that of MAX 
TOUR-TSP. Finally, the reducer emits key-value pairs of form 
< f(r,s),{r,s} >. 

A rigorous analysis of Algorithm B] is not possible without 
making some assumptions about trie available number of 
reducer instances, as well as the load-balancing strategies of 
the namenodeFj For the sake of analysis, assume that the set R 
of records is sHarded equally on h m map nodes. Each mapper 
instance then takes time 0(t(b)\R\/h m ). Using notation from 
the previous sub-section, assume that u distinct BKVs (and 
hence, blocks) are generated. 

If we now assume that at least u reducer instances are 
available, and the namenode distributes the load equally, the 
run-time of the reduce phase will be dominated by the largest 
block generated, since this will be the last reducer to terminate. 
If a uniform distribution on BKVs is assumed, all blocks are of 
equal size and contain \R\/u records. Using the analysis from 
the previous sub-section, the total time taken by Algorithm [4] 
is then 0(t(b)\R\/h m + (\R\/u) 2 t(f) + (\R\/u) q + \R\/u). The 
last term is due to the block-merge procedure and is asymp¬ 
totically subsumed. The final expression is 0(t(b)\R\/h m + 
m/n)H{f) + {\R\/uY). 

Assuming Zipf distribution, the largest block size is \R\/H U 
with H u defined earlier as the partial u-term harmonic sum. 
The run-time of Algorithm HI with Zipf distribution of BKVs 

is 0(t(b)\R\/h m + (\R\/H^t(f) + m/H u ) q + \R\/H U ) = 
0(t(b)\R\/h m + (\R\/H u ) 2 t(f) + (\R\/HuY). 


C. Multi-pass Sorted Neighborhood with global scoring 
heuristics 


Unfortunately, there is no evidence yet that an approxima¬ 
tion algorithm with constant approximation ratio exists for 
Sorted Neighborhood with a global scoring heuristic. Given 
the similarity of the problem to generalized TSP |56j ], it could 
be the case that no such algorithm exists unless P = NP. We 


pose this as a conjecture in Section VI 

One possible option is to use Algorithm [3j but ignore the 
fact that / is not local. Theorem p] would no longer apply, 
but it is reasonable to assume the algorithm will still perform 
well empirically. The question then is if we can optimize the 
algorithm further, given that we know / is global and not local. 

Algorithm [5] shows the pseudocode for a single-pass Sorted 
Neighborhooa procedure that attempts two optimizations. The 
algorithm assumes a global scoring heuristic. Note that the 
only difference in the multi-pass procedure in Algorithm 2 
for the case of global / is that it would invoke Algorithm 5 
instead of Algorithm Hj in line 2. 

The first optimization is that a modified version of Algo¬ 
rithm [ 3 ] is ran to obtain the final list R l . The modification 
relates to the list polarities of each of the lists returned by 


'-This is the master node that dynamically controls the MapReduce 
workflow |13| 


the TourToList subroutine. Recall that the subroutine has two 
list choices (denoted as polarities ) for each Hamiltonian circuit 
output by MAX TOUR-TSP. The list polarity did not matter if 
/ was local, but because a global / implies that record pairs 
straddling blocks can have non-zero scores, the polarity of 
each list matters. For the first block, let TourToList randomly 
return one of the two choices. Next, assuming u blocks, let 
the fih block ( i ranging from 1 to u — llhave record r at the 
end. Let the i + 1 th block have endpoint^ J records si and S 2 - 
If/(«t ,r) > f(s 2 , r), TourToList returnstne list (si,..., s 2 ), 
otherwise it returns the reversed list. 

With this modification in place. Algorithm [3] is run from 
lines 1-9, and a second optimization called greedy adjacent 
swapping is conducted on the list R l . To understand the 
optimization, consider the i th block and the i + 1 th block 
(1 ranging from 1 to u — 1), and let the first record of the 
i + 1 th block be s and the last record of the i th block be r. 
Let the record r' in the i th block have highest score (according 
to scoring heuristic /) when paired with s, compared to all 
other records in the i th block. If r/ r', swap r and r' if the 
resulting increase in score (f (r, s') — f(r, s)) due to the swap 
is greater than the (possiblmji loss in the local 2-score of the 
i th block. In the forward pass, i ranges from 1 to it—1 (the first 
to the penultimate block). The backward pass is similar but 
starts from Block u and goes traverses upwards through the 
list to Block 2. Each of these two passes yields two different 
lists R'i and R l b with their own w-scores. Of the three lists, 
R l , R'f and R l b , the list with the highest 2-score is output (line 

7). 

The reason why all three lists must be compared is because 
greedy adjacent swapping can theoretically lead to a decline 
in the original w-score. Figure [7] in the Appendix proves this 
through a constmction. 

With uniform BKV distribution, each swap takes time 
0{\R\/u) since \R\/u records in block i must be compared 
with the first record in the next block (assuming forward pass). 
Since i ranges from 1 to u — 1, the forward pass takes time 
0(t(f)\R\(u — l)/u) over the time taken by Algorithm [ 3 ] 
Determining list polarity takes time 0(t(f)) since only two 
comparisons are required; we assume that it is subsumed 
by the swapping procedure. Similarly, an upper bound of 
0(t(f)\R\(u — 1 )/H u ) should be added to the run-time of 
Algorithm [3j if a Zipf distribution of BKVs is assumed. 

We note that Algorithm [5] is amenable to numerous practical 
optimizations, including caching of scores^] and a multi¬ 
threaded implementation for each of the forward and backward 
passes. We leave investigating and evaluating such optimiza¬ 
tions for future work. 

D. Practical Usage 

Table [D] lists the three proposed algorithms with run-times 
for both uniform and Zipf distributions. As noted earlier, the 
run-time of the multi-pass procedure may be calculated by 
weighting, if every blocking key either follows a uniform or 
Zipf distribution. If another distribution is expected, a similar 
analysis would first have to be carried out for Algorithm [3] 
and the weighting procedure extended appropriately. Finally, 

^Technically, the records corresponding to the vertices preceding and 
following the dummy vertex in the returned Hamiltonian circuit 

14 Because of the approximate nature of the solutions, there is always a 
small chance that the swap will end up increasing the 2-score of that block, 
which gives us all the more reason to perform the swap 

15 Since it is quite conceivable that many pairs will end up getting scored 
more than once, as the algorithm evaluates greedy adjacent swapping 





Algorithm 5 Single-pass Sorted Neighborhood, global / 

Input : 

• Set R of records 

• Global scoring heuristic / 

• Blocking key b 

Output : 

• Candidate set of pairs 1’ 

Method : 

1) Run modified Algorithm [T] from lines 1 to 9 with same 
inputs, to get list R l 

2) Run merge procedure on R l to get I’| with score F\ 

3) Perform greedy adjacent swapping in a forward pass 
on R l to get R l f 

4) Run merge procedure on R l j to get T 2 with score F 2 

5) Perform greedy adjacent swapping in a backward pass 
on R l to get R l b 

6 ) Run merge procedure on R l b to get with score /••-> 

7) Output as r the highest-scoring of Pi, IC and [’3 with 
ties broken in that order 


note that although we implicitly assume that I/O or shuffling 
(in the case of Algorithm Bl costs are subsumed in the derived 
O expression, these costs could be prohibitive for specific 
databases or implementations and must be separately derived, 
if this is the case. In the original papers, I/O costs of Sorted 
Neighborhood were experimentally found not to dominate R), 

0 - 

In the context of the full record linkage procedure, a block¬ 
ing method in Table [II] should only be selected if its run-time 
plus 0(t(g)\R\) has a strict upper bound o(t(g)\R\ 2 ) where g 
is the sophisticated similarity function used in the second step. 
In the last decade, with machine learning procedures, genetic 
algorithms and expressive feature spaces dominating the state- 
of-the-art (22), [8J, (j has become increasingly expensive. We 
hypothesize that, with efficient, practical implementations of 
the proposed algorithms and careful selection of blocking keys 
and heuristics, the methods will prove to be qualitatively and 
computationally viable. As an additional advantage, improve¬ 
ments in max TSP will contribute directly to the quality of the 
proposed algorithms. 

VI. Conjectures and Conclusion 

This paper shows that devising a maximum-performing 
Sorted Neighborhood algorithm entails solving an NP- 
complete w-ordering problem. There is a close connection 
between 2-ordering and TSP. This connection is used to define 
and analyze three approximation algorithms for the special 
but practically important case of w = 2. In the future, we 
will implement and evaluate these algorithms and attempt 
experimentally viable solutions for w > 2. We state the 
following conjectures: 

• For an arbitrary global heuristic /, no polynomial-time 
^-approximation algorithm can exist for approximating 
max SN unless P = NP. 

• For w > 2 and an arbitrary local /, any polynomial¬ 
time (in |f?|) ^-approximation algorithm is exponential 
in w — 1. 

Proving (or disproving) these conjectures will have direct 
ramifications on the approximability of the generic w-ordering 
problem. 
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Appendix 

A. Proof of Theorem [7] 

Recall that there is a list of u blocks, < R yi ,..., R Vu >, 
where each block is a set of records. We need to show that 
if each block is maximum-score w-ordered and the scoring 
heuristic / is local, then the ordered list of records is both 
necessary and s uffic ient for max SN. 

From Section [Tv| the w-score of any such list is the score 
of the candidate set F generated after the list is subject to the 
w-window merge step. Per Definition [4] max SN will always 
perform merge on a list that guarantees maximum w-score, 
compared to any other list that obeys SN semantics. Let such 
a list, R l , be denoted as the max list. We assume in this proof 
that the total number of records in R is strictly greater than 
w, since otherwise, every list would yield the same w-score 
and max SN would be trivially achieved. 

When the merge step commences on the max list, all records 
within a window of size w are paired and added to the 
candidate set F. There is an alternate way of characterizing 
r. Specifically, partition F into a set of (at most) 2 u sets, two 
for each block. Let the i th block contribute two disjoint sets 
I and I f. The reason for the 0 superscript will become clear 
shortly. 

The partition is achieved by the following construction: 

1) If both records in a pair {r, s} £ 1 are from the same 
block R Vi , place {r, s} in set T,;. 

2) If the records in a pair {r, *} £ T are from different 
blocks (say r £ R y and s £ R y .), then add {r, s} to 
F° if i < j otherwise add the pair to T°. 

Some of the sets may be empty; we remove them from the 
partition if so, and thereby fulfill the conditions of a partition. 
Using the partition: 

U U 

r = U r * u U r t (2 > 

i=l i—1 

By Definition Rlof the local scoring heuristic, the score of each 
pair in is OTnence, the superscript). Thus, the score of F is 
only contributed to by the first term in Equation [2] Since each 
of the sets operated upon by union is disjoint, tne following 
equation is directly derived: 

U 

score( r) = ^^score(Ti) (3) 

i=i 

Again, because of disjointness, taking the max on both sides 
allows us to move the max inside the summation in Equation 
[3] By the semantics of Algorithm [Tl and Definition R] of 
maximum-score w-ordering, max scorefTi) exactly equals the 
maximum w-score achievable over all orderings of the block 
R Vi . Thus, the condition that every block is independently 
maximum-score w-ordered is a sufficient one for maximizing 
the score of F. 

We prove necessity by contradiction. Suppose some block 
R Vi is not maximum-score w-ordered. Let its list version (of 
which the w-score will not be maximized per Algorithm ITJ) 
be R l y .. Let the candidate set generated as a result be r 7 , 
and assume T to be the generated candidate set if R Vi were 
maximum-score w-ordered. We assume T' was generated from 
a max list and show that this leads to a contradiction. Using 
Equation Rl and the construction of the partition, the difference 
between tne scores of T and T' will be score(T ,;) — score(r'). 
Since R Vi was maximum-score w-ordered in the list that 
generated F but not F', the difference is positive. If we 


replace list R l y . with another list that achieves higher w-score, 
the score of V' will monotonically improve. This proves the 
contradiction, since T' was clearly not generated from a max 
list. 

Thus, if any block is non-maximum-score w-ordered, the 
resulting global list of records will not be a max list. The 
contrapositive of the statement proves necessity. Coupled with 
the proof for sufficience, the full statement of the theorem is 
proved. 

B. Proof of Theorem [i] 

Theorem[2]showed that maximum-score 2-ordering was NP- 
complete; hence, it suffices to show a Karp reduction from the 
decision version of maximum-score 2-ordering on a set R of 
records. Let |i?| = m; R consists of records {r i,..., r m }. We 
also assume that m » w. 

We begin by constructing a new set S of m(w — 1) records, 
by mapping each record rt £ R to a set .S', which contains 
w — 1 records. Let S, be denoted as an internal set and i as 
the set ID of Si. Also, let each record in this set be denoted 
as sj, where j £ {1,2,..., w — 1} and is the internal ID of 
the record. S is simply the union of all the m internal sets 
that the records were mapped to. 

Technically, such a construction can be achieved by assign¬ 
ing S a schema containing just two attributes (called set ID and 
internal ID). Each record in S can be uniquely identified by 
employing both IDs in conjunction; hence, both IDs together 
constitute a compound primary key. 

We obtain a new scoring heuristic f for record pairs in S 
using the following construction: 

1) m, sf) = | £u=i Elli /(r«, r„), Vi e {1,, m} 

and Vc, d £ {1,..., w — 1} 

2 ) = f{ri, r-j), Vi,j £ {1,..., m},i ± j and 
Vc £ {1,... ,w — 1} 

3) f = 0 for all other record pairs 

Intuitively, Rule 1 uniformly assigns the same large score 
to the record pairs that lie within the same internal set, 
independent of the specific internal or set IDs of recordtpj 
If we visualize the scoring heuristic / as a discrete m x m 
matrix (but with diagonal entries undefined; see Definition R}, 
the quantity on the right hand side is simply the sum of all 
(defined) matrix entries. The factor of 1/2 is to prevent each 
entry from being counted twice, due to the symmetry of the 
two summations. 

Rule 2 shows that a non-zero score can only exist between 
the c th records of two different internal sets (with c ranging 
from 1 to w — 1) and is equal to the score between the original 
records that were mapped to the two sets in the construction. 
Note that the score itself is independent of the value of c. Rule 
3 assigns every other record pair in the scaled problem a score 
of zero. 

Thus, Rule 1 applies when two records belong to the same 
internal set (and have the same set ID), while Rule 2 applies 
when two records have different set IDs but the same internal 
ID. Otherwise, Rule 3 applies. 

The computation of f occurs in polynomial time since w 
is a constant and computing pairwise scores for m{w — 1) 
records is at most 0(t(f)m 2 (w — l) 2 ). By definition, t(f) is 
polynomial in m and w — 1 is constant. 

Finally, recall that the decision version of maximum-score 
2-ordering accepted as input a decision constant k. A valid 
solution must return True if a list with 2-score at least k exists, 

16 This is evident when we consider that the variables on the left hand side 
of Rule 1 are independent of the variables on the right hand side 
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Fig. 4. An illustration of three concatenated sub-lists, with w = 4. Looking at 
the window, the only record pair from different internal sets that can contribute 
a non-zero score is 


where k is a positive integer. Construct the decision constant k! 
of the transformed problem instance using the equation below: 

k' = T 1 +T 2 (4) 

Here, T-\ and T 2 are given by the equations below: 

Ti=m( 2 j- Y2J2f(r u ,r v ) (5) 

T 2 = k{w - 1) (6) 

We perform maximum-score w-ordering on this transformed 

problem instance, with the inputs being the constructed set of 
records S, the scoring heuristic /' and the decision constant 
k!. We claim that the ( True or False ) output of the oracle 
solving the transformed problem is exactly the output of the 
original problem instance. In other words, we claim a correct 
Karp reduction. 

We start by showing the correctness of the Karp reduction 
for the False output. We prove by contradiction that if the 
oracle return False , then it cannot be the case that an ordering 
of the original set R of records exists, with 2-score at least k, 
assuming the score heuristic /. 

Suppose not. That is, the oracle returned False but there is 
some ordering R l that has 2-score at least k. Without loss of 
generality, let R l =< n,..., r m >. Since this list has 2-score 
at least k , the following is true: 

m— 1 

£/(n,r i+ i)>fc. (7) 

i= 1 

Given this information, consider (in the transformed problem) 
the list S l formed by concatenating the sub-lists S\ ,..., S l m , 
where each S\ is simply < sj, ..., s™ -1 > for all i ranging 
over set IDs (that is, from 1 to to). Let us calculate the w-score 
of this list. 

Since the window has size w while a sub-list has w — 1 
records, it will be the case that each sub-list S l will fall entirely 
within some window, and therefore, all records within a sub¬ 
list will be paired and added to the (transformed problem) 
candidate set T'. Therefore, each sub-list, by itself, contributes 
exactly 1 ) pairs. Given Rule 1 in the construction of f and 
that there are exactly to sub-lists, the expression T) (Equation 
[5} will be the score of all record pairs in T' such that both 
records in the pair are from the same sub-list. 

Also, because of the way each sub-list is defined, it must 
be the case that in every window, exactly one pair of the 
form {s|, .s) i } can contribute a non-zero score, among record 
pairs where both records are from different sub-lists. Consider 
Figure [4] for an illustration, with to = 3 and w = 4. Since c 
ranges from 1 to to - 1 and i ranges from 1 to to — 1, such 
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Fig. 5. An illustration of an interleaved list. Intuitively, the list is interleaved 
because records and sf are not ‘lined up’ with other records from their 
respective internal sets 
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Fig. 6. An example of a list that is stacked but unaligned, since the order 
of internal IDs in the first sub-list is 1,2,3 but in the second sub-list is 2,1,3 
(and in the third, 1,3,2). The list would be unaligned even if just one of its 
sub-lists were ‘out of alignment’ 


pairs will contribute total score (using Rule 2): 

m— 1 

T 2 = ( w ~ 1) XI ft 7 " 1 ’ ri+1 ) ( 8 ) 

i =1 

By Equations B0 and [8] T' 2 > T 2 . But this implies that the 
w-score of the constructed list S l > T) + XL which implies 
that the w-score is at least k', by Equation [4] This leads to 
a contradiction, since the oracle returned False. Thus, it must 
be the case that if a list with 2-score at least k exists in the 
original instance, a list with 2-score at least k' exists in the 
transformed instance. By the contrapositive, the reduction is 
correct, assuming the oracle returns False. 

In order to show the correctness of the oracle for a True 
output, we introduce some additional terminology. Define a list 
S l to be an interleaved list if there exist distinct records si and 

i . 1 

si in the list, such that the list contains, between these records, 
at least one record of the form where i,j £ {1 ,... ,to}, 
c, d, e £ — 1} and i ^ j. Let a list that is not 

interleaved be defined as a stacked list. For example, the list 
in Figure [4] is stacked. The list in Figure [5] is interleaved. 

Intuitively, a list is interleaved if records from some internal 
set Si are not all lined up against one another. Place each of 
|S'!! orderings of S into either the set of interleaved orderings 
I 1 or of stacked orderings S l . Together, T 1 and S 1 form 
a partition of the set of all orderings. Note that a stacked 
ordering is simply the concatenation of sub-lists, where each 
sub-list is of form Sj, one of (w — 1)! possible orderings of 
internal set .S',. 

Furthermore, define a stacked ordering to be aligned if the 
internal ID of the c th record in every sub-list (with c ranging 
from 1 to w — 1) is the same. If a stacked ordering is not 
aligned, let it be denoted as unaligned. Figure [4] is an example 
of a stacked aligned ordering, while Figure [o]is an example 
of a stacked unaligned ordering. The set S l is then partitioned 
into the sets S l a and S l u of stacked aligned and unaligned 
stacked orderings respectively. 

Define the alignment function to be a function that takes 
a stacked unaligned ordering as input and aligns it by using 
a particular sub-list as a pivot. Thus, the output is a stacked 
aligned ordering. For example. Figure R] is the alignment of 
Figure [6] if we use the first sub-list as tne pivot. Specifically, 
the function rearranges the records in every sub-list except 
the pivot sub-list, such that the order of internal IDs in every 
sub-list now reflects the order of internal IDs in the pivot sub¬ 
list. Given that there can be at most to distinct pivot sub-lists 
for an unaligned ordering, the alignment function can yield 





































at most m aligned orderings for a given input. We call this 
set of (at most m) possible alignments (for a given unaligned 
ordering S l u ) the alignment set of S l u . 

Using the concepts defined above, we state the following 
properties: 

Property 1. Ml 1 £ I 1 ,MS 1 £ S l , w-score(P) < w-score(S l ). 

Property 2. \/S l u £ S' u , the w-score of S l u will be no more 
than the w-score of any list in the alignment set of S l u . 

We do not provide technical proofs of these properties but 
they may be proved by contradiction, by calling on Rules 1 
and 2 in the construction of /'. Specifically, if the first property 
is false, it can be shown that Rule 1 was incorrectly applied, 
while if the second property is false. Rule 2 was incorrectly 
applied. 

Intuitively, the first property holds because the score as¬ 
signed to two records in the same internal set is ‘too large’ 
for Rule 2 to compensate for i|^] It is always the case, in an 
interleaved list, that at least two records sharing the same set 
ID will not share a window at all. The second property holds 
for a similar reason (but with relation to Rules 2 and 3), when 
comparing a stacked unaligned ordering to its aligned version. 

We prove the correctness of the oracle for True outputs by 
first observing that every stacked ordering (whether aligned or 
unaligned) is guaranteed to have w-score at least (Equation 
|5j, by an earlier part of the proof. 

If an ordering with w-score at least k' is interleaved, then 
every stacked ordering also has score at least k! (Property 
1). Consider a specific stacked aligned ordering that is the 
concatenation of sub-lists S[, ..., S[ n and with records in each 
sub-list S\ ordered as < s],..., s™ -1 >. We can follow 
the proof showing correctness of the oracle for False outputs 
backwards to derive Equation [7] which in turn, implies the 
existence of a 2-ordering with score at least k. 

The same proof can be employed, with only a change in 
notation, if the ordering is not interleaved but is stacked and 
aligned. If the ordering is stacked and unaligned, we pick an 
element from the ordering’s alignment set (which has a w- 
score at least as high, by Property 2) and conduct the analysis 
in the previous sentence. In either case. Equation [7] will be the 
end result. 

We conclude that the Karp reduction is correct. This proves 
the theorem that maximum-score w-ordering for any constant 
w > 2 is NP-complete. 

C. Proof of Theorem [5] 

By Theorem IT] we know that maximum-score 2-ordering 
each block individually is necessary and sufficient for achiev¬ 
ing max SN, since Algorithm ]3] assumes a local /. Suppose 
the w-score of the max list (the list on which max SN runs 
the merge step) is $*. Before performing 2-ordering, we have 
a list of u blocks, < R yi ,, R Vu >, where each block is 
a set of records. Let the maximum score of a block y, be 
£ {i,... ,u}. By Theorem [I] and because / is local, 

we have: 

U 

(9) 

i —1 

If we can prove an approximation ratio p for each <1>" \y 7 ], 
where p is the approximation ratio of MAX TOUR-TSP, then 
Equation [9] will prove the theorem. 


Block 1 


Block 2 


Block 3 



Fig. 7. A three-block construction showing that greedy adjacent swapping 
(GAS) can potentially lead to decline in score, w = 2 is assumed, and it can 
be verified (using Figure [8j that each block on the left hand side is maximum- 
score 2-ordered 

Block 1 Block 2 Block 3 Inter-block 
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Fig. 8. The score matrix used in the construction in Figure [7] Any scores 
not in the matrix evaluate to 0 


We drop the second subscript and consider an arbitrary 
block R y with \R y \ = m records. In line 8 of Algorithm 
[3] the auxiliary subroutine RecordsToGraph converts R y into 
a weighted, complete, undirected graph (with an extra dummy 
vertex). Consider again Figure[3] Since the problem is to find a 
tour, we can use the dummy vertex, without loss of generality, 
as the first and last vertex in the returned tour (by MAX 
TOUR-TSP) < dummy, e\, v ±,..., v m , e m +i, dummy >. 
We adopt the usual notation that record r, was bijectively 
mapped to vertex Vj, with i ranging from 1 to rn. 

The main observation is that, regardless of whether the tour 
is optimal or not, any edge following or preceding the dummy 
vertex will always be 0 by construction, and will not contribute 
anything to the total tour weight. The difference (between the 
optimal and sub-optimal tour weights) can only be due to the 
path < Vi,... ,v m >. Let the same path in the optimal tour 
be denoted as path*. If the approximation ratio of the TSP 
algorithm is p, this implies that: 

^-~/e£Edges(path) TV(e) ^ ^ (10) 

^~/e£Edges(path*) TU(e) 

The TourToList subroutine in Algorithm [3] converts this inner 

f ith into the ordered list of records, as illustrated in Figure 
b). Thus, the numerator in the summation above is exactly 
\y\ and the denominator is $*[?/]. This proves the theorem. 

D. Construction showing score-decline of greedy adjacent 
swapping 


17 Since, as we stated earlier. Rule 2 assigns only a specific entry in the 
f-matrix to an eligible pair, whereas Rule 1 assigns the sum of all entries in 
the f-matrix to an eligible pair. Recall that f was non-negative 


We show a construction that prove s that greedy adjacent 
swapping (GAS), described in Section |V-C| does not always 
improve scores. The construction is shown in Figure [7] Al- 


















































though we only show the construction for forward pass, a 
symmetric case can be constructed for the backward pass. 



for each block in Figure [8] The correctness of the GAS 
procedure itself can also be verified by using the given scores 
(in Figure J8l) between records belonging to different blocks 
(inter-block scores in the figure). If we now perform merge 
and compute the candidate set score for the two lists in Figure 
[ 7 ] (before and after the GAS procedure was run), 1 ' orr,J is 
round to have score 6 + 7 + 6 + 1 + 3 = 23, where we 
break up the score by each block’s score (6, 6 and 3) and 
the scores contributed by records straddling adjacent blocks 
(7 and 1). Similarly, the candidate set score after GAS, T G/ ' ,S 
is computed as 5 + 0 + 1 + 4 + 3 = 13. The score has declined 
by a considerable margin. 

To understand why such a decline occurred, consider the 
swap that took place in the second block. If after the first swap 
(of records 2 and 4 in Block 1), we would have computed the 
candidate set score, it would have increased, since the increase 
in score would have been /(2,5) = 9 according to Figure J8l 
and the decline in the local 2-ordering score of Block 1 woula 
only be 6 — 5 = 1. The global decline occurs because the first 
and last records in Block 2 end up getting swapped in the 
next step, which is a valid step according to GAS semantics. 
Because the procedure is greedy , it only takes into account 
the impact that the swap has on the local score of the current 
block. In other words, the ramifications on previous blocks are 
ignored. 

Given that such declines can occur, we are therefore justified 
in comparing all three lists in line 7 of Algorithm [5] 


18 As an additional detail, list polarities are also correct, which is required 
by Algorithm p] The correctness is evident since /(4,5) > /(4,8) and 


/(8,9) > /(8711) 


